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Abstract 

Batch effects are responsible for the failure of promising genomic prognos- 
tic signatures, major ambiguities in published genomic results, and retractions of 
widely-publicized findings. Batch effect corrections have been developed to re- 
move these artifacts, but they are designed to be used in population studies. But 
genomic technologies are beginning to be used in clinical applications where sam- 
ples are analyzed one at a time for diagnostic, prognostic, and predictive applica- 
tions. There are currently no batch correction methods that have been developed 
specifically for prediction. In this paper, we propose an new method called frozen 
surrogate variable analysis (fSVA) that borrows strength from a training set for 
individual sample batch correction. We show that fSVA improves prediction ac- 
curacy in simulations and in public genomic studies. fSVA is available as part of 
the sva Bioconductor package, genomics; prediction; batch effects; personalized 
medicine; surrogate variable analysis 



1 Introduction 

Genomic technologies were originally developed and applied for basic science research 
and hypothesis generation [8|. As these technologies mature, they are increasingly be- 
ing used as clinical tools for diagnosis or prognosis Q. The high-dimensional mea- 
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surements made by microarrays can be used to classify patients into predictive, prog- 
nostic, or diagnostic groups. Despite the incredible clinical promise of these technolo- 
gies there have only been a few signatures that have successfully been translated into 
the clinic. 

One of the reasons for the relatively low rate of success is the impact of unmea- 
sured technological or biological confounders. These artifacts are collectively referred 
to as "batch effects" because the processing date, or batch, is the most commonly mea- 
sured surrogate for these unmeasured variables in genomic studies E3l [121 l28l . The 
umbrella term batch effects also refers to any unmeasured variables that can vary from 
experiment to experiment, ranging from the technician who performs the experiment 
to the temperature and ozone levels that day Ifl4l [9] . 

Batch effects are responsible for the failure of promising genomic prognostic sig- 
natures [2 3 1, major ambiguities in published genomic results E5l fTl. and retractions 
of widely-publicized findings l24l [T3l . In many experiments, the signal from these 
unmeasured confounders is larger than the biological signal of interest lfl8l . But the 
impact of batch effects on prediction problems has only recently been demonstrated 
EH [20). Batch effects were also recognized as a significant hurdle in the development 
of personalized genomic biomarkers in the Institute of Medicine's report on clinical 
genomics [21]. 

While a number of methods have been developed for removing batch effects in 
population-based genomic studies lfT2lfTTl[T6l[T8ll28l . there is currently no method for 
removing batch effects for prediction problems. There are two key differences between 
population level corrections and corrections designed for prediction problems. First, 
population level corrections assume that the biological groups of interest are known in 
advance. In prediction problems, the goal is to predict the biological group. Second, 
in prediction problems, new samples are observed one at a time, so the surrogate batch 
variable will have a unique and unknown value for each sample. 
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Here we propose frozen Surrogate Variable Analysis (fS VA) as a method for batch 
correction in prediction problems. fSVA borrows strength from a reference database to 
address the challenges unique to batch correction for prediction. The fSVA approach 
has two main components. First, surrogate variable analysis (S VA) is used to correct for 
batch effects in the training database. Any standard classification algorithm can then 
be applied to build a classifier based on this clean training data set. Second, probability 
weights and coefficients estimated on the training database are used to remove batch 
effects in new samples. The classifier trained on the clean database can then be applied 
to these cleaned samples for prediction. 

We show with simulated data that the fS VA approach leads to substantial improve- 
ment in predictive accuracy when unmeasured variables are correlated with biological 
outcomes. We also apply fSVA to multiple publicly available microarray data sets and 
show improvements in prediction accuracy after correcting for batch. The methods de- 
veloped in this paper have been implemented in the freely available sva Bioconductor 
package. 

2 Frozen Surrogate Variable Analysis methodology 

2.1 Removing batch effects from the training set 

The first step in batch correction for prediction problems is to remove batch effects 
from the training set. In the training set, the biological groups are known. This setting is 
similar to the population genomics setting and we can use a model for gene expression 
data originally developed for population correction of unmeasured confounders. If 
there are m measured features and n samples in the training set, we let X mxn be the 
matrix of feature data, where is the value of feature ;' for sample j. For convenience, 
we will refer to X as the expression matrix for the remainder of the paper. However, our 
methods can be generally applied to any set of features, including measures of protein 
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abundance, gene expression, or DNA methylation. 

We propose a linear model for the relationship between the expression levels and 
the outcome of interest yf. Xij = bo + Dfii Skiyj) + e ij- The are a set of basis 
functions parameterizing the relationship between expression and outcome. If the pre- 
diction problem is two-class, then p\ = 1 and s\ {yj) = 1 (yj = 1) is an indicator function 
that sample j belongs to class one. In a multi-class prediction problem k > 1 and the 
Sjt(-) may represent a factor model for each class. In matrix form this model can be 
written lfT6lfT7l. 

X = BS + E (1) 

where S plXn is a model matrix of p\ biological variables of interest for the n samples, 
Bmxpi are the coefficients for these variables, and E mx „ is the matrix of errors. 

In genomic studies, the error term E is not independent across samples lfT2l[T6l[T7l 
l28l ITOl [181 ITTI . That is, there is still correlation between rows of E after accounting 
for the model S. The correlation is due to unmeasured and unwanted factors such as 
batch. We can modify model [TJ to account for these measured biological factors and 
unmeasured biological and non-biological factors: 

X = BS + FG + U (2) 

where G P2X „ is a pi x n random matrix, called a dependence kernel ifTTIl that pa- 
rameterizes the effect of unmeasured confounders, T mxp2 is the mx pi matrix of co- 
efficients for G, and U mxn is the m x n matrix of independent measurement errors. We 
previously demonstrated that such a decomposition of the variance exists under general 
conditions typically satisfied in population genomic experiments ifTTl . 

In the training set, the biological classes are known, so S is known and fixed. But 
the matrices B, T and G must be estimated. fSVA first performs surrogate variable 
analysis (SVA) on the training database in order to identify surrogates for batch effects 
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in the training samples. The training set can be "cleaned" of batch effects by regressing 
the effect of the surrogate variables out of the data for each feature. Any classification 
algorithm can then be developed on the basis of the clean training data set. 

SVA is an iterative algorithm that alternates between two steps. First SVA esti- 
mates the probabilities 7T, y = Pr(^. ^d\X,S,G),7tib =Vr(bi. ^0|tf. ^0,X,S,G) using 
an empirical Bayes' estimation procedure 1(171 171 1261 . These probabilities are then com- 
bined to define an estimate of the probability that a gene is associated with unmeasured 
confounders, but not with the group outcome 

n iw = Pr(6,. = 0&Ti. f 0\X,S,G) 

= Pv(b,. = 0\ Y i- ¥= 0,X,S,G)Pr(yi. + 0\X,S,G) 
= (l-?ra>)% 

The second step of the SVA algorithm weighs each row of the expression matrix X by 
the corresponding probability weight pi iw and performs a singular value decomposition 
of the weighted matrix. Letting m>„ = 7T 1W the decomposition can be written WX — 
UDV T . After iterating between these two steps, the first p2 weighted left singular 
vectors of X are used as estimates of G. An estimate of p2 can be obtained either 
through permutation |4| or asymptotic ifTSI approaches. 

Once estimates G have been obtained, it is possible to fit the regression model in 
equation |5] using standard least squares. The result are estimates for the coefficients B 
and V. Batch effects can be removed from the training set by setting X cleim = X — FG. 
Any standard prediction algorithm can then be applied to X clecm to develop a classifier 
based on batch-free genomic data. The result is a prediction function f{X c j ean ) that 
predicts the outcome variable yj based on the clean expression matrix. 
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2.2 Removing batch effects from new samples 

Removing batch effects from the training database is accomplished using standard 
population genomic SVA batch correction. But the application of classifiers to new 
genomic samples requires batch correction of individual samples when both the batch 
and outcome variables are unknown. The fSVA algorithm borrows strength from the 
training database to perform this batch correction. 

To remove batch effects from a new sample X.y, it is first appended to the training 
data to create an augmented expression matrix X? — \XX.ji] where [•] denotes con- 
catenation of columns. To estimate the values of G for the new sample, fSVA uses a 
weighted singular value decomposition, using the probability weights estimated from 
the training database WX j ' = U J ' D J 'v '' T . The result is an estimate G>' that includes a 
column for the new sample. 

To remove batch effects from the new sample, fSVA uses the coefficients estimated 
from the training database f and the estimated surrogate variables: x clean ' = X' — 
f &' . If there are n training samples, then the (n + l)st column of X clean i' represents 
the new clean sample. The classifier built on the clean training data can be applied to 
this clean data set to classify the new sample. 

3 Fast fSVA methodology 

fSVA requires that a new singular value decomposition be applied to the augmented 
expression matrix once for each new sample. Although this is somewhat computa- 
tionally intensive, in typical personalized medicine applications, sample collection and 
processing will be spread over a long period of time. In this setting, computational time 
is not of critical concern. However, for evaluating the fSVA methodology or develop- 
ing new classifiers using cross-validation, it is important to be able to quickly calculate 
clean expression values for test samples. 
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We propose an approximate fSVA algorithm that greatly reduces computing time 
by performing a streaming singular value decomposition [29 30]. The basic idea be- 
hind our computation speed-up is to perform the singular value decomposition once 
on the training data, save the left singular vectors and singular values, and use them to 
calculate approximate values for the right singular values in new samples. 

When removing batch effects from the training data, the last step is a weighted 
singular value decomposition of the training expression matrix WX = UDV T . After 
convergence, the first p2 columns of the matrix V are the surrogate variables for the 
training set. Since U and V are orthonormal matrices, we can write V T = D~ l U T WX . 
The matrix P — D 1 U T W projects the columns of X onto the right singular vectors V T . 
Pre-multiplying a set of new samples X" ew by P results in an estimate of the singular 
values for the new samples: V Tnew = P T X new , The surrogate variable estimates for 
the new samples consist of the first p2 columns of y Tnew . We obtain clean data for 
the new samples using the estimated coefficients from the training set, identical to the 
calculation for the exact fS VA algorithm: x dean '" ew = X new - f6" ew . 

Estimates obtained using this approximate algorithm are not identical to those ob- 
tained using the exact fSVA algorithm. The projection matrix used in the approxima- 
tion, P T , is calculated using only the samples in the training set. However, there is only 
a one-sample difference between the projection calculated in the training set and the 
projection that would be obtained with exact fSVA. As the training set size grows, the 
approximation is closer and closer to the answer that would be obtained from the ex- 
act algorithm. For smaller databases, there is less computational burden in calculating 
the exact estimates. However, for large training sets, the computational savings can be 
dramatic, as described in the simulation below. 
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4 Simulation Results 



We performed a simulation examining the benefit of fSVA in prediction problems. In 
order to do this, we simulated data using equation [2] under different distributions of 
each parameter. We also discretized the probability weights n\y and and varied 
the distribution of these probability weights (Table|4]i. We crafted these simulations to 
mimic scenarios with a subtle outcome and a strong batch effect, which is frequently 
the case in genomic data. 





Parameter Distributions 


Affected Features 




Br- 


-tf(0,l) 


50% batch-affected 


Scenario 1 


]> 


-tf(0,3) 


50% outcome-affected 




U r 


~N(Q,2) 


40% affected by batch and outcome 






-tf(0,l) 


50% batch-affected 


Scenario 2 


Tr 


-tf(0,4) 


50% outcome-affected 




U r 


^N(0,3) 


40% affected by batch and outcome 




Br- 


-N(0,1) 


80% batch-affected 


Scenario 3 


V- 


-tf(0,4) 


80% outcome-affected 




U r 


~N(0,3) 


50% affected by batch and outcome 



Table 1 : Specifications for the three simulation scenarios used to show the perfor- 
mance of fSVA. We performed three simulations under slightly different parameteriza- 
tions to show the effectiveness of fSVA in improving prediction accuracy. Parameters 
from equation[2]were simulated using the distributions specified in this table. Addition- 
ally, the percentage of features in the simulation affected by batch, outcome, or both 
are as indicated in this table. Results from these simulations can be found in Figure[T] 



We also specified that each data set have two batches and two outcomes, each 
equally distributed. We varied the amount of confounding between batch and outcome 
in the simulated database, from a Pearson's correlation of to a correlation of over 0.90. 
For the simulated new samples, the batch and outcome were always uncorrelated. 

We simulated 100 database samples and 100 new samples using the parameters 
described above. Each sample had 10000 features. As a control, for each iteration in 
addition to performing fSVA correction, we performed SVA correction on the simu- 
lated database alone, and also performed prediction with no batch correction on the 
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simulated database or new samples. 

To quantify the effect that fSVA had on prediction, we performed exact fSVA as 
described above on the simulated database and new samples. We then performed Pre- 
diction Analysis of Microarrays (PAM), a commonly used method for classifying mi- 
croarrays J27). The PAM prediction model was built on the SVA-corrected database, 
and then used to predict the outcomes on the f SVA-corrected new samples. Each sim- 
ulation was repeated 100 times for robustness. 

We found that in general the prediction accuracy measures for different iterations 
of a simulation varied highly, but the ordinality remained relatively constant. Therefore 
to display results we randomly selected three graphs from each of the scenarios, using 
the sample function in R (Figure [TJ. Each of the graphs from the 100 iteractions for 
each scenario can be found on the author's website. 
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Scenario 1 





\ 1 1 r 

0.0 0.2 0.4 0.6 OS 



i 1 1 r 

0.0 0.2 0.4 0.6 0.8 



Scenario 2 




\ 1 1 r 

0.0 0.2 0.4 0.6 0.8 




\ 1 1 r 

0.0 0.2 0.4 O.S OS 




\ 1 1 r 

0.0 0.2 0.4 0.6 0.8 



Scenario 3 



i 1 1 1 r 

0.0 0.2 0.4 0.6 08 




\ 1 1 r 

0.0 0.2 0.4 0.6 



i 1 1 r 

0.0 0.2 0.4 0.6 0.8 



Legend 



No Correction 
SVA only Correction 
exact fSVA Correction 
fast fSVA Correction 



Figure 1 : fSVA improves prediction accuracy of simulated datasets. We created 
simulated datasets (consisting of a database and new samples) using model[2]and tested 
the prediction accuracy of these using R. For each simulated data set we performed ei- 
ther exact fS VA correction, fast fS VA correction, SVA correction on the database only, 
or no correction. We performed 100 iterations on each simulation scenario described 
in table |4] Using the sample function in the R package, we randomly chose three 
iterations from each of the three scenarios to display here. 



We found that fSVA improved the prediction accuracy in all of our simulations 
(Figure [TJ. Interestingly, exact fSVA generally outperformed fast fSVA at all of the 
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correlation levels except the highest correlation levels. However both fSVA methods 
out-performed our control of performing SVA on the database alone. Additionally any 
method of batch-correction generally outperformed no batch correction whatsoever. 

When the batch and outcome were not correlated, we saw ambiguous performance 
from using fSVA. This is not unexpected since it has been shown that in scenarios with 
no confounding between batch and outcome, batch has a minimal effect on prediction 
accuracy l22l . When databases had extreme confounding between batch and outcome 
(above 0.85) we saw the benefits of all the batch-correction methods drop off. This is 
because in these situations, SVA on the database cannot differentiate batch and outcome 
in the database. 

While in each of the simulations there was an accuracy cost to using fast fSVA 
versus exact fSVA, the computational time savings was dramatic. In the scenario de- 
scribed, with 100 samples in the database and 100 new samples, the wall-clock com- 
putational time using a standard desktop computer for exact fSVA was 133.9 seconds, 
versus just 1.3 seconds for fast fSVA. Using 50 samples in the database and 50 new 
samples, exact fSVA required 17.9 seconds versus 0.4 seconds for fast fSVA. We en- 
courage users to consider both the accuracy and computational times when selecting 
which algorithm to use for a particular data set. 

5 Results from Microarray Studies 

We examined the effect that fsva had on several microarray studies, obtained from the 
Gene Expression Omnibus (GEO) website J6j. All except three of the studies were 
preprocessed/standardized as described previously. Three of the studies (GSE2034, 
GSE2603, GSE2990) were obtained from GEO and fRMA-normalized. 

Each of the studies was randomly divided into equally-sized "database" and "new 
sample" subsets. We SVA-corrected the database subset, and then built a predictive 
model (PAM) on that corrected data. We then performed fSVA correction on the new 
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samples. After performing fSVA correction, we measured the prediction accuracy of 
the model built on the database. This process was iterated 100 times for each study 
to obtain confidence intervals. This method is virtually identical to the simulation 
described above. 

Results from this process can be found below (Table[5]l. Five of the studies showed 
significant improvement using fSVA. One study showed marginal improvement, with 
its 95% confidence interval overlapped zero. Three studies showed a cost to using 
fSVA, though in all three cases the 95% confidence interval for the true cost overlapped 
zero. 



Study Name No Correction (CI) Exact fSVA (CI) Improvement (CI) 



GSE 10927 


0.97 (0.96, 


0.97) 


0.98 


(0.98, 


0.99) 


0.02 (0.01, 0.02) 


GSE 13041 


0.61 (0.59, 


0.63) 


0.69 


(0.66, 


0.71) 


0.07 (0.05, 0.10) 


GSE13911 


0.93 (0.93, 


0.94) 


0.94 


(0.94, 


0.95) 


0.01 (0.00, 0.01) 


GSE2034 


0.51 (0.49, 


0.52) 


0.54 


(0.52, 


0.56) 


0.03 (0.01, 0.05) 


GSE2603 


0.68 (0.66, 


0.70) 


0.66 


(0.64, 


0.67) 


-0.02 (-0.04, 0.00) 


GSE2990 


0.59 (0.58, 


0.61) 


0.58 


(0.56, 


0.60) 


-0.02 (-0.04, 0.00) 


GSE4183 


0.89 (0.88, 


0.91) 


0.88 


(0.86, 


0.89) 


-0.02 (-0.03, 0.00) 


GSE6764 


0.74 (0.72, 


0.76) 


0.75 


(0.72, 


0.77) 


0.01 (-0.01, 0.04) 


GSE7696 


0.78 (0.76, 


0.79) 


0.80 


(0.78, 


0.81) 


0.02 (0.01, 0.04) 



Table 2: fSVA improves prediction accuracy in 5 of the 9 studies examined. The 

remaining 4 studies showed indeterminate results since the 95% confidence intervals 
overlapped zero. In order to find the prediction accuracy results, each of the studies 
was randomly divided into "database samples" and "new samples". fSVA-correction 
was then performed as described above. We then built a predictive model (PAM) on 
the database and tested the prediction accuracy on the new samples. 



6 Conclusions 

Batch effects have been recognized as a crucial hurdle for population genomics ex- 
periments [18, 22 J. They have also been recognized as a critical hurdle in developing 
genomic signatures for personalized medicine EH . Here we have introduced the first 
batch correction method specifically developed for prediction problems. Our approach 
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borrows strength from a training set to infer and remove batch effects in individual 
clinical samples. 

We have demonstrated the power of our approach in both simulated and real gene 
expression microarray data. However, our approach depends on similarity between the 
training set and the test samples, both in terms of the genes affected and the estimated 
coefficients. In small training sets, these assumptions may be violated. Similarly, 
training sets that show near perfect correlation between batch variables and biological 
classes represent an extreme case that can not be directly corrected using fSVA. An 
interesting avenue for future research is the use of publicly available microarray data 
to build increasingly large training databases for batch removal. 

The methods we have developed here are available as part of the sva Bioconductor 
package Ifl9ll . 
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